Integrating high dimensional bi-directional parsing models for gene mention tagging
نویسندگان
چکیده
MOTIVATION Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. RESULTS We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. AVAILABILITY Data sets, programs and an on-line service of our gene mention tagger can be accessed at http://aiia.iis.sinica.edu.tw/biocreative2.htm.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملHigh-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models
We considered the gene mention tagging task as a classification problem and applied support vector machines (SVM) to solve it. We selected a large set of features as the input and trained two SVM models with different multiclass extension methods. We found that backward parsing constantly outperformed forward parsing regardless of the multiclass extension methods and obtained high precision rat...
متن کاملBidirectional Sequence Labeling via Dual Decomposition
In this paper, we propose a bidirectional algorithm for sequence labeling to capture the influence of both the left-to-right and the right-to-left directions. We combine the optimization of two unidirectional models from opposite directions via the dual decomposition method to jointly label the input sequence. Experiments on three sequence labeling tasks (Chinese word segmentation, English POS ...
متن کاملTwo optimal algorithms for finding bi-directional shortest path design problem in a block layout
In this paper, Shortest Path Design Problem (SPDP) in which the path is incident to all cells is considered. The bi-directional path is one of the known types of configuration of networks for Automated Guided Vehi-cles (AGV).To solve this problem, two algorithms are developed. For each algorithm an Integer Linear Pro-gramming (ILP) is determined. The objective functions of both algorithms are t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 24 شماره
صفحات -
تاریخ انتشار 2008